Local and global topics in text modeling of web pages nested in web sites

نویسندگان

چکیده

Topic models assert that documents are distributions over latent topics and words. A nested document collection has inside a higher order structure such as articles in journals, podcasts within authors, or web pages sites. In single of documents, global shared across all documents. For sites, topic frequencies likely vary sites site, almost certainly from page to page. hierarchical prior for this with distribution, site varying around the distribution. Web one United States local health department often contain geographic news not found on other some unique an individual site. Regular ignore nesting may identify but cannot label those nor corresponding owner. Explicitly modeling identifies owning local. US data, coverage is defined at level after removing words pages. Hierarchical can be used study how well covered.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Analyzing new features of infected web content in detection of malicious web pages

Recent improvements in web standards and technologies enable the attackers to hide and obfuscate infectious codes with new methods and thus escaping the security filters. In this paper, we study the application of machine learning techniques in detecting malicious web pages. In order to detect malicious web pages, we propose and analyze a novel set of features including HTML, JavaScript (jQuery...

متن کامل

Adaptive Web Sites: Automatically Synthesizing Web Pages

The creation of a complex web site is a thorny problem in user interface design. In IJCAI ’97, we challenged the AI community to address this problem by creating adaptive web sites: sites that automatically improve their organization and presentation by mining visitor access data collected in Web server logs. In this paper we introduce our own approach to this broad challenge. Specifically, we ...

متن کامل

Identifying Corporate Managerial Topics with Web Pages

This paper has as its main aim to analyse how corporate web pages can become an essential tool in order to detect strategic trends by firms or sectors, and even a primary source for benchmarking. This technique has made it possible to identify the key issues in the strategic management of the most excellent large Spanish firms and also to describe trends in their long-range planning, a way of w...

متن کامل

Local Aspects of the Global Ranking of Web Pages

Started in 1998, the search engine Google estimates page importance using several parameters. PageRank is one of those. Precisely, PageRank is a distribution of probability on the Web pages that depends on the Web graph. Our purpose is to show that the PageRank can be decomposed into two terms, internal and external PageRank. These two PageRanks allow a better comprehension of the PageRank sign...

متن کامل

Text Categorization of Commercial Web Pages

In this paper we describe a new on-line document categorization strategy that can be integrated within Web applications. A salient aspect is the use of neural learning in both representation and classification tasks. Within text documents conceived as images, the regions of interest (RoI) containing information meaningful for categorization are identified with the support of a supervised neural...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Computational Statistics & Data Analysis

سال: 2022

ISSN: ['0167-9473', '1872-7352']

DOI: https://doi.org/10.1016/j.csda.2022.107518